Search CORE

99 research outputs found

Minimizing Communication for Eigenproblems and the Singular Value Decomposition

Author: Ballard Grey
Demmel James
Dumitriu Ioana
Publication venue
Publication date: 01/01/2010
Field of study

Algorithms have two costs: arithmetic and communication. The latter represents the cost of moving data, either between levels of a memory hierarchy, or between processors over a network. Communication often dominates arithmetic and represents a rapidly increasing proportion of the total cost, so we seek algorithms that minimize communication. In \cite{BDHS10} lower bounds were presented on the amount of communication required for essentially all

O(n^3)

-like algorithms for linear algebra, including eigenvalue problems and the SVD. Conventional algorithms, including those currently implemented in (Sca)LAPACK, perform asymptotically more communication than these lower bounds require. In this paper we present parallel and sequential eigenvalue algorithms (for pencils, nonsymmetric matrices, and symmetric matrices) and SVD algorithms that do attain these lower bounds, and analyze their convergence and communication costs.Comment: 43 pages, 11 figure

arXiv.org e-Print Archive

CiteSeerX

Communication-optimal Parallel and Sequential Cholesky Decomposition

Author: Grey Ballard
Grey Ballard
James Demmel
James Demmel
Oded Schwartz
Oded Schwartz
Olga Holtz
Olga Holtz
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/01/2009
Field of study

Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower bounds on the communication cost (both for bandwidth and for latency) of conventional (O(n^3)) matrix multiplication to Cholesky factorization, which is used for solving dense symmetric positive definite linear systems. Second, we compare the costs of various Cholesky decomposition implementations to these lower bounds and identify the algorithms and data structures that attain them. In the sequential case, we consider both the two-level and hierarchical memory models. Combined with prior results in [13, 14, 15], this gives a set of communication-optimal algorithms for O(n^3) implementations of the three basic factorizations of dense linear algebra: LU with pivoting, QR and Cholesky. But it goes beyond this prior work on sequential LU by optimizing communication for any number of levels of memory hierarchy.Comment: 29 pages, 2 tables, 6 figure

arXiv.org e-Print Archive

CiteSeerX

Crossref

A 3D Parallel Algorithm for QR Decomposition

Author: Ballard Grey
Demmel James
Grigori Laura
Jacquelin Mathias
Knight Nicholas
Publication venue
Publication date: 14/05/2018
Field of study

Interprocessor communication often dominates the runtime of large matrix computations. We present a parallel algorithm for computing QR decompositions whose bandwidth cost (communication volume) can be decreased at the cost of increasing its latency cost (number of messages). By varying a parameter to navigate the bandwidth/latency tradeoff, we can tune this algorithm for machines with different communication costs

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Faster all-pairs shortest paths via circuit complexity

Author: Aho Alfred V.
Ballard Grey
Bremner David
Kerr Leslie R.
Publication venue
Publication date: 21/05/2014
Field of study

We present a new randomized method for computing the min-plus product (a.k.a., tropical product) of two

n \times n

matrices, yielding a faster algorithm for solving the all-pairs shortest path problem (APSP) in dense

n

-node directed graphs with arbitrary edge weights. On the real RAM, where additions and comparisons of reals are unit cost (but all other operations have typical logarithmic cost), the algorithm runs in time

\frac{n^3}{2^{\Omega(\log n)^{1/2}}}

and is correct with high probability. On the word RAM, the algorithm runs in

n^3/2^{\Omega(\log n)^{1/2}} + n^{2+o(1)}\log M

time for edge weights in

([0,M] \cap {\mathbb Z})\cup\{\infty\}

. Prior algorithms used either

n^3/(\log^c n)

time for various

c \leq 2

, or

O(M^{\alpha}n^{\beta})

time for various

\alpha > 0

and

\beta > 2

. The new algorithm applies a tool from circuit complexity, namely the Razborov-Smolensky polynomials for approximately representing

{\sf AC}^0[p]

circuits, to efficiently reduce a matrix product over the

(\min,+)

algebra to a relatively small number of rectangular matrix products over

{\mathbb F}_2

, each of which are computable using a particularly efficient method due to Coppersmith. We also give a deterministic version of the algorithm running in

n^3/2^{\log^{\delta} n}

time for some

\delta > 0

, which utilizes the Yao-Beigel-Tarui translation of

{\sf AC}^0[m]

circuits into "nice" depth-two circuits.Comment: 24 pages. Updated version now has slightly faster running time. To appear in ACM Symposium on Theory of Computing (STOC), 201

arXiv.org e-Print Archive

Crossref